Lab 4: Modeling correlation and regression

Practice session covering topics discussed in Lecture 3

M. Chiara Mimmi, Ph.D. | Università degli Studi di Pavia

July 27, 2024

GOAL OF TODAY’S PRACTICE SESSION

  • Review the basic questions we can ask about ASSOCIATION between any two variables:
    • does it exist?
    • how strong is it?
    • what is its direction?
  • Introduce a widely used analytical tool: REGRESSION



The examples and code from this lab session follow very closely the open access book:

Topics discussed in Lecture # 3

Lecture 3: topics

  • Testing and summarizing relationship between 2 variables (correlation)
    • Pearson’s 𝒓 analysis (param)
    • Spearman test (no param)
  • Measures of association
    • Chi-Square test of independence
    • Fisher’s Exact Test
      • alternative to the Chi-Square Test of Independence
  • From correlation/association to prediction/causation
    • The purpose of observational and experimental studies
  • Widely used analytical tools
    • Simple linear regression models
    • Multiple Linear Regression models
  • Shifting the emphasis on empirical prediction
    • Introduction to Machine Learning (ML)
    • Distinction between Supervised & Unsupervised algorithms

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We will also use the packages below (specifying package::function for clarity).
# Load them for this R session

# General 
library(fs)      # file/directory interactions
library(here)    # tools find your project's files, based on working directory
library(paint) # paint data.frames summaries in colour
library(janitor) # tools for examining and cleaning data
library(dplyr)   # {tidyverse} tools for manipulating and summarizing tidy data 
library(forcats) # {tidyverse} tool for handling factors
library(openxlsx) # Read, Write and Edit xlsx Files
library(flextable) # Functions for Tabular Reporting
# Statistics
library(rstatix) # Pipe-Friendly Framework for Basic Statistical Tests
library(lmtest) # Testing Linear Regression Models # Testing Linear Regression Models
library(broom) # Convert Statistical Objects into Tidy Tibbles
library(tidymodels) # not installed on this machine
library(performance) # Assessment of Regression Models Performance 
# Plotting
library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics

DATASETS for today

We will use examples (with adapted datasets) from real clinical studies, provided among the learning materials of the open access books:

Importing Dataset 1 (NHANES)

Name: NHANES (National Health and Nutrition Examination Survey) combines interviews and physical examinations to assess the health and nutritional status of adults and children in the United States. Sterted in the 1960s, it became a continuous program in 1999.
Documentation: dataset1
Sampling details: Here we use a sample of 500 adults from NHANES 2009-2010 & 2011-2012 (nhanes.samp.adult.500 in the R oibiostat package, which has been adjusted so that it can be viewed as a random sample of the US population)

  • Adapting the function here to match your own folder structure

NHANES Variables and their description

[EXCERPT: see complete file in Input Data Folder]

_______ qui

_______ >>>>> SALTO

https://www.statology.org/interpret-regression-output-in-r/ https://www.youtube.com/watch?v=ebHLMyqC2UY https://www.learnbymarketing.com/tutorials/explaining-the-lm-summary-in-r/

Splitting the sample

If we seek any causal relationship between an explanatory and outcome variable we should split our sample to have:

  1. one sample to "train" a model on
  2. one sample to "test" the model on
  • Otherwise out model will seem better than it is (since it will be specifically built to “fit” our data)

  • The function rsample::initial_split will assist in that

Linear regression performance

We can start looking at how the model performs by applying it to our nhanes_test sub-sample, utilizing the function predict

Linear regression performance: predicted values in test sample

  • We can look the 95% CI of any predicted values
  • We can look the CI 95% of a single predicted values

Linear regression performance: RMSE

Basically we are asking: “how does the prediction compare to the actual test dataset?”

For this we take the difference between the predicted and the actual value as

RMSE = Root Means Squared Error

This is quite close to the Residual standard error that we got from the regression model summary (6.843) – despite that was taken from training data and this comes from testing data

Linear regression performance: \(R^2\)

Linear regression models outputs: model fit

_______SALTO<<<<<